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Abstract 

Many neuroimaging studies have collected ultra-high dimensional imaging data in 
order to identify imaging biomarkers that are related to normal biological processes, 
diseases, and the response to treatment, among many others. These imaging data are 
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often represented in the form of a multi-dimensional array, called a tensor. Existing 
statistical methods are insufficient for analysis of these tensor data due to their ultra- 
high dimensionality as well as their complex structure. The aim of this paper is 
develop a tensor partition regression modeling (TERM) framework to establish an 
association between low-dimensional clinical outcomes and ultra-high dimensional 
tensor covariates. Our TPRM is a hierarchical model and efficiently integrates four 
components: (i) a partition model; (ii) a canonical polyadic decomposition model; 

(iii) a factor model; and (iv) a generalized linear model. Under this framework, 
ultra-high dimensionality is not only reduced to a manageable level, resulting in 
efficient estimation, but also prediction accuracy is optimized to search for informative 
sub-tensors. Posterior computation proceeds via an efficient Markov chain Monte 
Carlo algorithm. Simulation shows that TPRM outperforms several other competing 
methods. We apply TPRM to predict disease status (Alzheimer versus control) by 
using structural magnetic resonance imaging data obtained from Alzheimer’s Disease 
Neuroimaging Initiative (ADNI) study. 

Keywords: Big data; ADNI; Bayesian hierarchical model; Tensor decomposition; Tensor 

regression; Chib augmentation method; MCMC. 
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1 Introduction 


Many neuroimaging studies have collected ultra-high dimensional imaging data in order to 
identify imaging biomarkers that are related to normal biological processes, diseases, and 
the response to treatment, among many others. The imaging data provided by these studies 
are often represented in the form of a multi-dimensional array, called a tensor. Existing 
statistical methods are insufficient for analysis of these tensor data due to their ultra-high 
dimensionality as well as their complex structure. 

The aim of this paper is to develop a novel tensor partition regression modeling (TPRM) 
framework to use high-dimensional imaging data, denoted by x, to predict a scalar response, 
denoted by y. The scalar response y may include cognitive outcome, disease status, and 
the early onset of disease, among others. In various neuroimaging studies, imaging data 
are often measured at a large number of grid points in a three (or higher) dimensional 
space and have a multi-dimensional tensor structure. Without loss of generality, we use 
X = G denote an order D tensor, where D > 2. Vectorizing x 

leads to a (n^=i Jk) x 1 vector. Examples of x include magnetic resonance imaging (MRI), 
diffusion tensor imaging (DTI), and positron emission tomography (PET), among many 
others. These advanced medical imaging technologies are essential to understanding the 
neural development of neuropsychiatric and neurodegenerative disorders and normal brain 
development. 

Although a large family of regression methods have been developed for supervised learn¬ 
ing (Hastie et ah, 2009 Breiman et al., 1984 Friedman, 1991 Zhang and Singer, 2010), 
their computability and theoretical guarantee are compromised by this ultra-high dimen¬ 
sionality of imaging data. The hrst set of promising solutions is high-dimensional sparse 
regression (HSR) models, which often take high-dimensional imaging data as unstructured 
predictors. A key assumption of HSR is its sparse solutions. HSRs not only suffer from 


diverging spectra and noise accumulation in ultra-high dimensional feature spaces (Fan and 


Fan, 2008 Bickel and Levina, 2004), but also their sparse solutions may lack biological in¬ 


terpretation in neuroimaging studies. Moreover, standard HSRs ignore the inherent spatial 
structure of the image that possesses a wealth of spatial information, such as spatial cor¬ 
relation and spatial smoothness. To address some limitations of HSRs, a family of tensor 
regression models has been developed to preserve the spatial structure of imaging tensor 


data, while achieving substantial dimensional reduction (Zhou et ah, 2013). 

The second set of solutions adopts functional linear regression (FLR) approaches, which 
treat imaging data as functional predictors. However, since most existing FLR models focus 


on one dimensional curves (Muller and Yao, 2008 Ramsay and Silverman, 2005), general- 
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izations to two and higher dimensional images is far from trivial and reqnires snbstantial 


research (Reiss and Ogden, 2010). Most estimation approaches of FLR approximate the 
coefficient fnnction of snch fnnctional regression models as a linear combination of a set 
of hxed (or data-driven) basis functions. For instance, most estimation methods of FLR 
based on the hxed basis functions (e.g., tensor product wavelet) are required to solve an 
ultra-high dimensional optimization problem and suffer the same limitations as those of 
HSR. 

The third set of solutions usually integrates supervised (or unsupervised) dimension 
reduction techniques with various standard regression models. Given the ultra-high dimen¬ 
sion of imaging data, however, it is imperative to use some dimension reduction methods 
to extract and select ‘low-dimensional’ important features, while eliminating most redun¬ 


dant features (Johnstone and Lu, 2009; Bair et al., 2006 Fan and Fan, 2008; Tibshirani 


et ah, 2002 Krishnan et ah, 2011). Most of these methods hrst carry out an unsupervised 


dimension reduction step, often by principal component analysis (PCA), and then £t a 
regression model based on the top principal components (Caffo et ah, 2010). Recently, 
for ultra-high tensor data, unsupervised higher order tensor decompositions (e.g. parallel 
factor analysis and Tucker) have been extensively proposed to extract important informa¬ 


tion of neuroimaging data (Martinez et ah, 2004 Beckmann and Smith, 2005 Zhou et al. 


2013). Although it is intuitive and easy to implement such methods, it is well known that 


the features extracted from PCA and Tucker can be irrelevant to the response. 

In this paper, we develop a novel TPRM to establish an association between imaging 
tensor predictors and clinical outcomes. Our TPRM is a hierarchical model with four 
components: (i) a partition model that divides the high-dimensional tensor covariates 
into sub-tensor covariates; (ii) a canonical polyadic decomposition model that reduces the 
sub-tensor covariates to low-dimensional feature vectors; (hi) a generalized linear model 
that uses the feature vectors to predict clinical outcomes; (iv) a sparse inducing normal 
mixture prior is used to select informative feature vectors. Although the four components 
of TPRM have been independently developed/used in different settings, the key novelty of 
TPRM lies in the integration of (i)-(iv) into a single framework for imaging prediction. In 
particular, the hrst two components (i) and (ii) are designed to specifically address the three 
key features of neuroimaging data: relatively low signal to noise ratio, spatially clustered 
ehect regions, and the tensor structure of imaging data. The neuroimaging data are often 
very noisy, while the ‘activated’ (or ‘ehect’) brain regions associated with the response 
are usually clustered together and their size can be very small. In contrast, a crucial 
assumption for the success of most matrix/array decomposition methods (e.g., singular 
value decomposition) is that the leading components obtained from these decomposition 
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methods capture the most important feature of a multi-dimensional array. Under TPRM, 
the ultra-high dimensionality of imaging data is dramatically reduced by using the partition 
model. For instance, let’s consider a standard 256 x 256 x 256 3D array with 16,777,216 
voxels, and its partition model with 32^ = 32, 768 sub-arrays with size 8x8x8. If we 
reduce each 8x8x8 into a small number of components by using component (ii), then 
the total number of reduced features is around O(IO^). We can further increase the size 
of each subarray in order to reduce the size of neuroimaging data to a manageable level, 
resulting in efficient estimation. 

The rest of the article is organized as it follows. In Section we introduce TPRM, 
the priors, and a Bayesian estimation procedure. In Section we use simulated data to 
compare the Bayesian decomposition with several competing methods. In Section we 
apply our model to the ADNI data set. In Section]^ we present some concluding remarks. 


2 Methodology 


2.1 Preliminaries 


We review several basic facts of tensors (Kolda and Bader, 2009). A tensor x = 


is a multidimensional array, whose order D is determined by its dimension. For instance, 
a vector is a tensor of order 1 and a matrix is a tensor of order 2. 

The inner product between two tensors A = and y = (2/ji...jo) 

is the sum of the product of their entries given by 


Ji Jp 

(A,y) = ^ ... ^ ^ji-.-jpUji-.-jp- 
ii=i iD=i 


The outer product between two vectors a^) = G and = (aj.f) G is a 

matrix M = of size Ji x J 2 with entries rrij-^j^ = A tensor A G 

is a rank one tensor if it can be written as an outer product of D vectors such that 
A = ad) o ad)... o a)^), where a^) G for /c = 1, ■ ■ ■ , D. Moreover, the parallel factor 
analysis, also known as PARAFAC or CP decomposition, factorizes a tensor into a sum 
of rank-one tensors such that 

A^f:A,ad)oaf o...oap), (1) 

r=l 

where a^) = G for /c = 1,..., D and r = 1,..., R. See FigureHfor an illustration 
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of a 3D array. 
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Figure 1: Figure copied from (Kolda and Bader, 2009). Panel (a) illustrates the CP 


decomposition of a three way array as a sum of R components of rank-one tensors, i.e. 


~ CLrObrO Cr 


We need the following notation throughout the paper. Suppose that we observe data 
{(j/j, Xi, Zi) : i = 1,, N} from n subjects, where the Xi’s are tensor imaging data, z* is a 
Pz x 1 vector of scalar covariates, and pi is a scalar response, such as diagnostic status or 
clinical outcome. If we concatenate all D dimensional tensor XiS into a (D-l-1) dimensional 
tensor X = {Xi,i = 1,... ,N} = we consider the CP decomposition of X as 

follows: 




or 


X 




R 

r=l 


( 1 ) ( 2 ) 
^jir^j2r 




( 2 ) 


where A = diag(Ai, • • ■ , Xr), for d = 1,..., D, and L = {lir). The 

matrices A^'^^’s and L are called factor matrices. 


2.2 Tensor Partition Regression Models 

Our interest is to develop TPRM for establishing the association between responses y and 
their corresponding imaging covariates X and clinical covariates Z. The hrst component 
of TPRM is a partition model that divides the high-dimensional tensor X into S disjoint 
sub-tensor covariates X^^\ that is 

A = and U A^"') = 0. (3) 

Although the size of A^*^ can vary across s, it is assumed that without loss of generality, 
g j^pix...xpD gjgg Qf ^(s) ig homogeneous such that S = Ilk=i{Jk/Pk)- 

The second component of TPRM is a canonical polyadic decomposition model that 
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reduces the sub-tensor covariates to low-dimensional feature vectors. Specifically, it is 
assumed that for each s, we have 




\K,-,A'P,AfAf\L.\\+e^ 




(4) 


where As — diag(A5“\ ■ ■ ■ , A^*) eonsists of the weights for eaeh rank of the decomposition in 
(§, A<f e ^PdxR Qj-Q factor matrices along the d-th dimension of df, and Lg G 
is the factor matrix along the subject dimension. It is assumed that the elements of 
are measurement errors and ~ -^(0, The elements of Lg 

capture the major variation in due to subject differences, while the common structure 


among the subjects is absorbed into the factor matrices for d = 1,... ,D (Kolda and 


Bader, 2009). 


There are two key advantages of using ([^ and Q. First, the use of the partition model 
allows us to concentrate on the most important local features of each sub-tensor, instead 
of the major variation of the whole image, which may be unassociated with the response of 
interest. In many applications, although the effect regions associated with responses may be 
relatively small compared with the whole image, their size can be comparable with that of 
each sub-tensor. Therefore, one can extract more informative features associated with the 
response with a high probability. Second, the use of the canonical polyadic decomposition 
model Q can substantially reduce the dimension of the original imaging data. Recall the 
discussions in Section 1 that the use of 8 x 8 x 8 sub-tensors can substantially reduce 
imaging size at a scale of O(IO^). 

The third component of TPRM is a factor model that decomposes L = [Li,... ,Ls] 
into the product 

L = GD + ^, (5) 


where each row of the matrix G is a iF x 1 vector of common unobserved (latent) factors gp 
D G correspond to the matrix of K latent basis functions used to represent L] 

is a matrix representing idiosyncratic errors. The proposed factor model not only reduces 
the feature vector into a manageable level but it also deals with the multicolinearity that 
may exist between features of the same partition. 

The fourth component of TPRM is a generalized linear model that links scalar re¬ 
sponses Hi and their corresponding reduced imaging features and clinical covariates 2 j. 
Specihcally, yt given and 2 , follows an exponential family distribution with density given 
by 

f{yi\6i) = m{yi) exp{r]{ei)T{yi) - a(0i)}, (6) 
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where m(-), ri{-), T(-), and a(-) are pre-specified functions. Moreover, it is assumed that 
Hi = E{yi\g^,Zi) satishes 

h(/ii) = 2 f 7 + c/fh, (7) 

where 7 and b are coefficient vectors associated with Zi and respectively and h(-) is a 
link function. 


2.3 Prior Distributions 


We consider the priors on the elements of b^. Bimodal sparsity promoting priors are key 
elements to perform variable selection and have been the subject of extensive research 


(Mayrink and Lucas, 2013; George and McCulloch, 1993, 1997). We assume the following 
hierarchy: 


bj,\6k,a^ ~ (l-4)F(&fc) + 4N(0,a2), 

hfclvr ~ Bernoulli(7r) and vr ~ Beta(Q;o 7 r, Oitt), 


( 8 ) 


where F{-) is a pre-specihed probability distribution. A common choice of F{-) is a degen¬ 
erate distribution at 0, leading to what is called the ‘spike and slab’ prior ([Mitchell and 


Beauchamp, 1988). A different approach is to consider F = N(0,e) with a very small e 
instead of putting a probability mass on bk = 0. Thus, the b^s are assumed to come from 
a mixture of two normal distributions. In this case, the hyperparameter should be large 
enough to give support to values of the coefficients that are substantively different from 0 , 
but not so large that unrealistic values of b^ are supported. In this article, we opt for the 
latter approach. 

The probability tt determines whether a particular component of Qi is informative for 
predicting y. A common choice for its prior is a non-informative distribution with aon = 
ttiTT = 1. However, this choice of the hyperparameters implies that its posterior mean is 
restricted to the interval [1/3, 2/3], a undesirable feature in variable selection. To fix this, 
we choose a ‘bathtub’ shaped beta distribution, since a prior concentrating most of its 
mass in the extremes of the interval ( 0 , 1 ) is evidently more suitable for variable selection 


(Gongalves et ah, 2013). 


We consider the priors on the elements of and For <7=1, 


(s)r’ 


and r = 1 ,..., i?, we assume 


4G) 

^{s)r 


T 


~ Gamma(z/or, ffir), and ~ N( 0 , 




( 9 ) 
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For (7, ti), it is assumed that 7 ~ N(7*,u“^/g) and v ~ Gamma(i/ou) 

For the elements of the factor model, we let G = ... ^Qk}-! D = {di,..., dp^} and 

the elements of 'ijjij ~ N (|o, for i = 1,..., n and j = 1,..., Pl- For each fixed 
A; = 1 ,..., iF, it is assumed that 


^fc~N( 0 ,n dfc~N( 0 ,iF and ~ Gamma(/3orv,, 


where Ik he a, k x k identity matrix. When pd is large, the columns of the factor matrix 
^(s)r approximately orthogonal, which is consistent with their role in the decomposition 
([^. When n and K are large, the matrices G and D are approximately orthogonal to each 
other. However, we only impose that the columns of these matrices span the space of the 
principal vectors, without explicitly requiring orthonormality, which leads substantially 


computational efficiency (Xinghao Ding and Garin, 2011). 


2.4 Posterior Inference 


Let 0 = ..., X, A, T, G, F>, 7 /,, .B, <5, TT, 7 , n}, where ... , 


1 ( d ) 4 { d )1 

qi)’ • • • > ^(5)J 

L = [Li, ..., Ls], A = [Ai,. .., A5], and r = ..., A Gibbs sampler algorithm 

is used to generate a sequence of random observations from the joint posterior distribution 
given by 


p{6\,A, y) oc p{y\e, X) p{A^^^ ..., A^^\ L, A, t\X) (10) 

p{G, D, t^\L) p{B\y, G, <5) p(< 5 | 7 r) p( 7 r)p( 7 |n) p{v). 

The Gibbs sampler essentially involves sampling from a series of conditional distribu¬ 
tions, while each of the modeling components is updated in turn. The detailed sampling 
algorithm is described on Appendix 

3 Simulation Study 

We carried out three sets of simulations to examine the hnite-sample performance of TPRM 
and its associated Gibbs sampler algorithm. 

3.1 Bayesian tensor decomposition 

The goals of the hrst set of simulations are (i) to compare the proposed Bayesian ten¬ 
sor decomposition method with the alternating least squares method, (ii) to investigate 
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how different choices of the rank R impact the tensor decomposition for distinct image 
modalities; and (iii) to access the importance of the partition model. We considered 3 
different imaging data sets (or tensors) including (I-l) a diffusion tensor image (DTI) of 
size 90 X 96 X 96, (1-2) a white matter RAVENS map image of size 99 x 99 x 70, and 
(1-3) a Tl-weighted MRI image of size 64 x 108 x 99. We fitted models ^ and (|^ to the 
three image tensors and decomposed each of them with R = 5, 10, and 20. We consider 
27 partitions of size 30 x 30 x 32 for the DTI image, 18 partitions of size 33 x 33 x 35 for 
the RAVENS map, and 24 partitions of size 32 x 27 x 33 for the T1 image, respectively. 
The hyperparameters were chosen to reflect non-informative priors and are set as = 1, 
Uit = 10“^, and k = 10“®. 

We run steps (a.l) — (a-4) of the Gibbs sampler algorithm in Section 2.4 for 5,000 
iterations. The efficiency of the proposed algorithm is observed through trace plots for 9 
random voxels. Figure|^shows the results for the reconstructed white matter RAVENS map 
decomposed with R = 20. The proposed algorithm is efficient and has fast convergence. 
At each iteration, we computed the quantity X = sach rank 

and each partition. Subsequently, we computed the reconstructed image, defined as X, and 
the posterior mean estimate of X after a burn-in sample of 3, 000. For each reconstructed 
image X, we computed its root mean squared error, RMSE = \ \X — X\\ 2 / 

We consider the non-partition model and compare the Bayesian method and the stan¬ 


dard alternating least squares method (Kolda and Bader, 2009). The results are shown in 
Table Figure shows an axial slice of the original images and the reconstructed images 
for ranks R = 5, 10, and 20 as S' = 1. For the images considered in this study, the Bayesian 
decomposition gives a smaller RMSE for all cases. As expected, the higher the rank, the 
smaller the reconstruction error. 


3.2 A 2-dimensional image example 

The goals of the second set of simulations are to assess whether TPRM is able to capture 
regions of interest, that significantly differ between two groups, in a 2-dimensional phan¬ 
tom and to compare TRPM with the functional principal components model (fPCA). We 
generate a data set {(?/*, A)) : i = 1,... ,n} with n = 200 according to 

Hi ~ Bernoulli(0.5) and Aj = X^iyi) -|- Xj, 

where Aj, £i = (ejj^jj), and X^iyi) are 32 x 32 matrices and A'o(l) and A'o(l) — A'o(O) are, 
respectively, shown in panels (a) and (b) of Figure We independently generated eiyj 2 
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Figure 2: Trace plots in 9 randomly chosen voxels in the white matter RAVENS map 
by using Bayesian tensor decomposition with R = 20. The trace plots indicate that the 
Markov chain converges after around 1000 iterations. 

Table 1: Root mean squared error for 3 different image modalities. The Bayesian decom¬ 
position outperforms the alternating least squares in each scenario. There is a smaller error 
measurement with an increase of the rank R. 




Tl-weighted 

WM RAVENS 

DTI 

R=5 

BayesianCP 

45.3191 

1.5853 

3.1656e-004 


ALS 

45.3636 

1.6013 

3.2506e-004 


Partition 

37.3712 

1.2178 

2.0929e-004 

R=10 

BayesianCP 

41.7018 

1.4382 

2.7367e-004 


ALS 

42.4350 

1.4533 

2.8247e-004 


Partition 

31.3836 

1.0186 

1.5748e-004 

R=20 

BayesianCP 

37.1796 

1.2885 

2.2911e-004 


ALS 

38.3166 

1.3166 

2.3676e-004 


Partition 

25.1574 

0.8085 

1.1349e-004 


from a V(0,5^) generator for ji, j 2 = 1,..., 32. Panel (c) in Figure]^ shows a generated 2D 
image from a random subject in group 1, which is almost indistinguishable from random 


noise. The hyperparameters in Section 2^ are chosen to reflect non-informative priors with 
z/qt = 1 , h>ir = 10 “'^, = 10 "^, and k = 10 “^. 

We applied two TPRMs with 5 = 1 (no-partition model) and 5 = 16 to the simulated 
data set. We compared the two TPRMs with the functional principal components model 
(fPCA), in which we learned the basis functions in the hrst stage and then included the 
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Figure 3: Simulation 1 for Bayesian tensor decomposition results. In each panel, the image 
on the left represents an axial slice of the original image, with top panel representing the 
non-partition model and bottom panels the partition model. In addition, from left to right 
we have the decomposed images for ranks 1.^ = 5,10,20, respectively. Panel (a): DTI 
image; panel (b): white matter RAVENS map; and panel (c): Tl-weighted image. 





top R most important principal components as covariates in a probit regression in the 
second stage. We set R = 8 for all three models, and skip the factor model in (|^ since 
the nnmber of featnres is manageable in this simulation. For TPRMs, we run the Gibbs 
sampler algorithm for 5000 iterations with a burn-in period of 3000 iterations. 




by using MCMC 


We also computed the Bayesian estimate of "P = 
samples. The estimated quantity V represents a projection of the group differences into the 
image space. Furthermore, we used MCMC samples to construct credible intervals for V in 
the imaging space. This quantity is extremely important in neuroimaging studies since it 
allows us to precisely identify signihcant locations in the brain that are associated with the 
response variable. Panels (c), (d) and (e) of Figure]^ are the posterior mean estimates of V 
for the fPCA model, TPRM with 5 = 1, and TPRM with S = 16, respectively. Panel (f) of 
Figure]^ shows the 95% credible interval for TPRMs with S = 16. The result reveals that 
the proposed model closely recovers the true underlying location where differences between 
both groups exist. 


3.3 A 3D image example 

The goal of this set of simulations is to examine the classihcation performance of the 
partition model in the 3D imaging setting. To mimic real data, we consider scenarios 
where we have extremely noisy images. Our goal is to compare 3 models whose main 
difference is the way the features are extracted: (i) functional principal component model 
(fpca); (ii) tensor alternating least squares (tals); and (hi) partition model with tensor 
decomposition and principal components on the extracted features (pmtd). We simulated 
the three-dimensional image covariates as follows: 


Qo T coi/iTo -|- Tj, 


for i = 1,...,200, Go e 3^64X64X50 is a hxed brain template with values ranging from 0 
to 250, Co = 50, 65, and the elements of the tensor Si G were independently 


generated from a N(0, 70^) generator. Moreover, we set Xq = 


i(i) 


e 3?®4 


x4 


,( 2 ) 


e 3?®^^^, and Aq^^ G 3?®®^^, and are matrices whose (cd + j)-th 


id) 




, where 


element of each column is equal to sin(j7r/14), with Cd indicating the position at the d-th 
coordinate. 

For each value of cq, we consider S = 32 and run the probit model for 100 generated 
datasets. For each run we compute the model accuracy dehned as 1 — Vi — > 

0.5))/iV, where I{A), is the indicator function of the set A. The results are summarized 
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(a)Xo(1) 


(b)Xjj(1)-XQ(0) 



Figure 4: Results of the 2-D imaging example: (a): R'o(O); (b): R'o(l) — Ro(0); (c): the 
estimated projection V for the fPCA model. Panels (d) and (e) are the posterior mean 


of the quantity V = 


A: R 


for the non-partition model BTRM with S' = 1 
and for BTRM with S = 16, respectively. Panel (f) is the 95% credible interval of V for 
BTRM(S' = 16) revealing the true underlying location, where differences between both 


groups exist. 
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in Table Further,we consider one generated dataset for each scenario and run a 10-fold 
cross-validation procedure. Each training set has 180 subjects and 20 testing subjects. The 
prediction accuracy is then computed on the test set and a summary with the results is 
also shown in Table Our partition model outperforms the tensor model and the fpca for 
both scenarios. 

Table 2: Mean model accuracy and mean prediction accuracy. The partition model out¬ 
performs the tensor model and the fpca for both measurements and both scenarios. As 
expected, the model accuracy is higher than the prediction accuracy. 




fpca 

tals 

pmtd 

Monte Carlo 

cc = 50 

cc = 65 

0.7750 

0.8650 

0.7100 

0.745 

0.8650 

0.9318 

10-fold CV 

cc = 50 

cc = 65 

0.5900 

0.6400 

0.5050 

0.5400 

0.7500 

0.7800 


4 Real data analysis 

“Data used in the preparation of this article were obtained from the Alzheimer’s Disease 
Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched 
in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical 
Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private 
pharmaceutical companies and non-proht organizations, as a $60 million, 5-year public 
private partnership. The primary goal of ADNI has been to test whether serial magnetic 
resonance imaging (MRI), positron emission tomography (PET), other biological markers, 
and clinical and neuropsychological assessment can be combined to measure the progression 
of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). Determination of 
sensitive and specihc markers of very early AD progression is intended to aid researchers 
and clinicians to develop new treatments and monitor their effectiveness, as well as lessen 
the time and cost of clinical trials. The Principal Investigator of this initiative is Michael 
W. Weiner, MD, VA Medical Center and University of California, San Francisco. ADNI 
is the result of efforts of many coinvestigators from a broad range of academic institutions 
and private corporations, and subjects have been recruited from over 50 sites across the 
U.S. and Canada. The initial goal of ADNI was to recruit 800 subjects but ADNI has been 
followed by ADNI-GO and ADNI-2. To date these three protocols have recruited over 
1500 adults, ages 55 to 90, to participate in the research, consisting of cognitively normal 
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older individuals, people with early or late MCI, and people with early AD. The follow up 
duration of each group is specihed in the protocols for ADNI-1, ADNI-2 and ADNI-GO. 
Subjects originally recruited for ADNI-1 and ADNI-GO had the option to be followed in 
ADNI-2. For up-to-date information, see www.adni-info.org.” 

We applied the proposed model to the anatomical MRI data collected at the baseline 
of ADNI. We considered 402 MRI scans from ADNIl, 181 of them were diagnosed with 
AD, and 221 healthy controls. These scans were performed on a 1.5 T MRI scanners using 
a sagittal MPRAGE sequence and the typical protocol includes the following parameters: 
repetition time (TR) = 2400 ms, inversion time (TI) = 1000 ms, flip angle = 8o, and held 
of view (FOV) = 24 cm with a 256 x 256 x 170 mm^ acquisition matrix in the x, y, and z 


dimensions, which yields a voxel size of 1.25 x 1.26 x 1.2 mm'^ (Huang et al., 2014) 


The Tl-weighted images were processed using HAMMER (Hierarchical Attribute Match¬ 
ing Mechanism for Elastic Registration), a free pipeline. The processing steps include skull 
and cerebellum removal, followed by tissue segmentation to identify the regions of white 
matter (WM), gray matter (GM) and cerebrospinal huid (GSF). Then, registration was 
performed to warp the subject to the space of the Jacob template (size 256 x 256 x 256 
mm^). Finally, a RAVENS map was calculated for each subject. The RAVENS method¬ 
ology precisely quantihes the volume of tissue in each region of the brain. The process 
is based on a volume-preserving spatial transformation that ensures that no volumetric 


information is lost during the process of spatial normalization (Davatzikos et ah, 2001). 


4.1 Selecting the partition model 

Following the pre-processing steps, we downsampled the images, cropped them, and ob¬ 
tained images of size 96 x 96 x 96 mm^. We then considered: 1) 64 partitions of size 
24 X 24 X 24 mm^; 2) 512 partitions of size 12 x 12 x 12 mm^; and 3) 4096 partitions of 
size 6x6x6 mm^. For different values of R, we selected the number of partitions based 
on the prediction accuracy of a 10-fold cross validation with the following steps. First, 
we extracted the features by tensor decomposition for different values of rank R. Second, 
to reduce the dimension of the extracted feature matrix, we applied a factor model with 
hxed rank K = 100. Third, we run 5000 iterations of the Bayesian probit model with the 


mixture prior described in Section |2^ with a burn-in of 3000 samples. Finally we computed 
the mean prediction accuracy in each model. Results are shown in Table 
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Table 3: Mean prediction accnracy for a 10-fold cross-validation procednre. There is a 
smaller error measnrement with an increase of the rank R. 


Partition size 

24 X 24 X 24 

12 X 12 X 12 

6x6x6 

R = 5 

— 

— 

0.7813 

R=10 

0.6714 

0.7311 

0.7613 

R = 20 

0.6937 

0.7587 

0.7588 

R = 30 

0.5770 

0.6544 

— 

No partition 

1x1x1 



R = 100 

0.7498 

— 

— 


4.2 Final analysis based on the selected model 

Based on the prediction accnracy, we selected the model with partitions of size 6x6x6 


mm'^ and R = 5. For the selected model, we rnn the model described on Section 2.2 


with hyperparameters chosen to reflect non-informative priors with z/qt = 1 , = 10“'^, 

= 10"^, K = 10“'^, = 10“® and (3ir^ = 10“®. In the first screening procednre, we 

eliminate the partitions whose featnres, extracted from the tensor decomposition, were zero 
since they may be not relevant to predict AD outcome. From the 4096 original partitions, 
only 1695 passed the first screening. Figure shows the correlation between the features 
extracted in the hrst screening step; panel a shows the correlations for all 8475 features 
and panel b gives a zoom in the hrst 200 correlations. Inspecting the hgure, we observe a 
very high correlation of features within the partitions and between nearby partitions. That 
justihes the need of the factor model in Equation (|^. 

Finally, we ran the Gibbs sampler algorithm described in Section 2.4 for 5, 000 iterations 
with a burn-in period of 3, 000 iterations. First, we compute a 95% credible interval for 
the coefficients b and conclude 36 features are important to predict AD outcome. Figure 
shows the 5 most important bases projected into the image space. The importance is given 
by the absolute value of the posterior mean in each one of the 36 selected features. 

Second, we let P = Db, we computed the posterior mean of the projection V = 
||A; P||. In addition, we compute a 95% credible interval for V. Figure]^ 

shows the results. Inspecting Figure reveals the regions of the biomarkers selected to 
predict the AD outcome. 

To hnd out specific locations in the brain responsible for predicting AD outcome, we 
label the locations indicated on the right panel of Figure considering the Jiilich atlas 
(Eickhoff et ah, 2005). The largest biomarker is found in the white matter region known 
as cingulum, specifically in the posterior cingulate cortex, as shown in Table Appendix 
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Correlation matrix for all features 



2000 4000 6000 8000 



Figure 5: Left panel shows the results for the correlation of the columns of the matrix of 
features L obtained in a hrst screening procedure. The right panel shows the same graphic 
with the hrst 200 features. We observe a high correlation between features of the same 
partitions and neighboring partitions. 



Figure 6: Panel (a) shows a 3D rendering of the Ravens map of a random selected control 
patient. Panels (b)-(f) show a 3D rendering of the 5 most important bases projected into 
the image space. The importance is given by the absolute value of the posterior mean in 
each one of the 36 selected features. 
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[Xj The posterior cingulate cortex is a limbic lobe that seem to be involved early and 
consistently in AD (Vemuri and Jack Jr, 2010). The next biomarker is a region known 
by the fornix, which is white-matter tract linking the hippocampus to several subcortical 
and cortical regions. Given that these areas are important for successful learning and 
memory, their ability to communicate with one another via the fornix may also be critical for 


performance in tasks challenging these cognitive domains (Postans et ah, 2014). Important 
biomarkers were also found within the gray matter tissue, including the posterior parietal 
lobe (Table |^. The parietal lobe has been linked to neuropsychological dehcits in AD 
likely due to its strong connectivity to other brain areas, and to the wide range of cognitive 
functions relying on parietal lobe functioning ( [Jacobs et al. , 2012). The most prominent 
biomarkers were the superior parietal cortex (Table |^. This region is important for spatial 
processing, selective attention, and spatial and non-spatial working memory (Jacobs et al. 


2012). Another important biomarker found is the hippocampus which is associated with 


learning and consolidation of explicit memories from short-term memory to cortical memory 
storage for the long term (Campbell and MacQueen, 2004). Previous studies have shown 
that this region is particularly vulnerable to Alzheimer’s disease pathology and already 


considerably damaged at the time clinical symptoms hrst appear (Schuff et al., 2009; Braak 


and Braak, 1998). Other important biomarkers found by TPRM are shown in Table 


Appendix [A} 



Figure 7: Left panel shows the results for the absolute value of the projection V for the 
ADNI dataset. Colors on the right side of the colorbar indicate regions where differences 
are higher between the control group and the group diagnosed with Alzheimer’s disease. 
The panel on the right shows a 95% credible interval for V. The colored parts indicate the 
regions of the biomarkers selected to predict the onset of AD. 
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5 Discussion 


We have proposed a Bayesian tensor partition regression model (TPRM) to establish an 
association between imaging tensor predictors and clinical outcomes. The ultra-high dimen¬ 
sionality of imaging data is dramatically reduced by using the proposed partition model. 
Our TPRM efficiently addresses some key features of neuroimaging data: relatively low 
signal to noise ratio, spatially clustered effect regions, and the tensor structure of imaging 
data. 

Our simulation studies showed a great performance of the TPRM when predicting a 
binary outcome from images in 2D and 3D. The TPRM provided much higher prediction 
accuracy when compared to both, the traditional funtional PCA and a tensor model that 
does not include partitions. In addition, the TPRM is able to capture regions of interest 
that differ between two groups in scenarios where the fPCA cannot perform well. 

Our application to the ADNI dataset showed that our TPRM is able to efficiently 
reduce and identify relevant biomarkers to predict Alzheimer’s disease outcome, overcoming 
challenges posed by the complexity of the imaging data such as ultra-high dimensionality 
and multicolinearity of features. 
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A Data analysis supporting material 


Table 4: Biomarkers that are relevant to predict AD outcome, based on the Jiilich atlas. 


Region 

of signifcant voxels 

WM Cingulum R 

6547 

WM Fornix 

4193 

GM Premotor cortex BA6 R 

4186 

WM Superior longitudinal fascicle R 

3276 

GM Visual cortex VI BA17 R 

2449 

GM Superior parietal lobule 7M L 

1896 

GM Visual cortex V2 BA18 L 

1549 

GM Superior parietal lobule 7A R 

1454 

GM Visual cortex V2 BA18 R 

1440 

GM Primary somatosensory cortex BA3a R 

1436 

GM Inferior parietal lobule PFm R 

1310 

WM Corticospinal tract L 

1299 

GM Inferior parietal lobule PF R 

1062 

GM Superior parietal lobule 7P R 

922 

GM Primary motor cortex BA4p R 

908 

GM Superior parietal lobule 5Ci R 

900 

WM Superior occipito-frontal fascicle R 

889 

GM Hippocampus cornu ammonis R 

827 

WM Acoustic radiation R 

807 


B Gibbs sampling algorithm for TPRM 


We provide the Gibbs sampling algorithm to sample from the posterior distribution (10) 


in Section 2A It involves sampling from a series of conditional distributions, while each 
of the modeling components is updated in turn. As an illustration, we divide the whole 
image into S equal sized regions and assume Ui ~ Bernoulh(/ri) with the link function h(-) 


being the probit function. By following Albert and Chib (1993), we introduec a normally 


distributed latent variable, tc,, such that Wi ~ 1); yi = l{wi > 0), where l(-) is an 

indicator function of an event. 

The complete Gibbs sampler algorithm proceeds as follows. 
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(a.O) Generate w = {wi, ■ ■ ■ , WnY from 


Wi\yi = 0 ~ l{wi < 0)N(zf7 + 1), 

S=1 

Wilvi = 1 ~ l{wi > 0)N(zf7 + 1). 


S=1 


(a.l) Update r(s) from its full conditional distribution 

D 

r(s)|-Gamma(z/o^ + {N\{pd)/2, + (1/2) ^ 


d=i 


JD 


where (s) = {dfW - ||A(^); A«, A^^),..., Af), 

(a.2) Update {A^f^}j^r from its full conditional distribution given by 

q-s \ 


Oil---JD 


{Af} 


Jdl’l 


N 


T(s){i,7,1,-J/Zp,' +G I • 


where = HAI*'; .... A*''-*', A7‘>, .... Af>, i<‘>||, yi,, is given by 

A''*>-||A<*>iAy,Af,...,A7,i<”>||+||A'*'i{A«},,„{A®},„...,{A7},,„{L!‘'K, 

and A,7)^ is a subtensor fixed at the entry jd along the d-th dimension of 7-r)- 
(a.3) Update {Ls}ir from its full conditional distribution given by 

T(S){XI%I-) 


{L,)i 


... rs^ 


N 


t{s){X%X^) + N 


,{t{s){X\T) + N) 


-1 


where X^ = ||..., A^^||, is the same as above, and is a subtensor 
hxed at the Ath entry along the subject dimension of 

(a.4) Update A*^^) from its full conditional distribution 


AW|-N 


r(5)(r^Ap),/:-) 
t{s){C^, C^) + K 


,{T{s){C^,C) + n) 


-1 


where U* = Hi/?; A<‘',...,Af',L'*>|| and lij is a vector of ones of size R. 
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(a.5) Update Q). from its full conditional distribution 

/ - 
Qk\ ~ ^g')i ^9 I “1“ U/> ^ I and fig T,fp^g ^ ' df^jlj , 

V i=i / i=i 

where l*~^ = L- Gdj + dkjQk for j = 1,..., P^- 
(a.6) Update dkj for j = 1,..., from its full conditional distribution 

dkj\- ~ N 

where = (l + Ef=i • 

(a.7) Update 'ipij from its full conditional distribution 

iJijl -Gamma (/^ov- + NPl/2,I3i^ + {L*^L*)/2^ , 

where L* = L - GD. 

(a.8) Update 5k from its full conditional distribution 

4 ~ bernoulli(pi/pi +po), 

where pi = 7rexp{ —(l/2(T^)(hfc)^} and po = 7rexp{ —(l/2e)6^}. 

(a.9) Update h from its full conditional distribution 


414 = 1 ~ ^{Y,Wigik/Y.9lk + iY^dlk + 

i i i 

414 = 0 ~ + l/e, {Y^alk + l/e)“^), 


where wf ^ = Wi - zjj - Ef'=i 
(a. 10) Update tt from its full conditional distribution 


7r| • • • rsj beta(Q;o7r 

5,r 5,r 
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(a. 11) Update 7 from its full conditional distribution 

7| ■ ■ — N (s;-'(i;7* + ZX) . ’ 

where S* = vlq + Z'^Z and =tc* = w — G^b. 

(a. 12) Update v from its full conditional distribution 

n| • • • ~ Gamma + g/2, + ( 7 ^ 7 ) 72 ^ . 


All the tensor operations described in steps (a.l) — (a.4) can be easily computed using 


Bader et ah (2012), http://www.sandia.gOv/~tgkolda/TensorToolbox/index-2.5.html, 
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